In-class Exercise 5: Modeling the Spatial Variation of the Explanatory Factors of Water Point Status using Geograpgically Weighted Logistic Regression (GWLR)

Author

Xu Siyi

Overview

Study area: Osun State, Nigeria

Osun.rds: LGA boundaries of Osun State

Osun_wp_sf.rds: water points within Osun State. sf point datafram

Model Variables

Dependent Variable: Water point status (functional/non-functional)

Independent variable:

  • distance_to_primary_road (continuous)

  • distance_to_secondary_road (continuous)

  • distance_to_tertiary_road (continuous)

  • distance_to_city (continuous)

  • distance_to_town (continuous)

  • water_point_population (continuous)

  • local_population_1km (continuous)

  • usage_capacity (categorical)

  • is_urban (categorical)

  • water_source_clean (categorical)

Data Import

pacman::p_load(sf,spdep, tmap, tidyverse,dplyr, funModeling,rgdal,  blorr,corrplot,ggpubr,GWmodel,skimr,caret)

Importing the Analytic Data

osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds") 
osun <- read_rds("rds/Osun.rds") 

osun_wp_sf %>%
  freq(input='status')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0

The Osun state consist of 55.5% functional water point and 44.5% non functional water point.

tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(osun)+
  tm_polygons(alpha=0.4)+
tm_shape(osun_wp_sf)+
  tm_dots(col="status",
          alpha=0.6)+
  tm_view(set.zoom.limits = c(9,12))

EDA

summary statistics with skim

osun_wp_sf %>%
  skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

There are 1144 missing value in install_year column so we won’t use it.

osun_wp_sf_clean<-osun_wp_sf %>%
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.))) %>%
  mutate(usage_capacity=as.factor(usage_capacity))
osun_wp_sf_clean %>%
  freq(input='status')

  status frequency percentage cumulative_perc
1   TRUE      2642      55.55           55.55
2  FALSE      2114      44.45          100.00
osun_wp_sf_clean %>%
  skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4756
Number of columns 75
_______________________
Column type frequency:
character 47
factor 1
logical 5
numeric 22
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 23 1.00 9 23 0 3 0
water_tech_category 23 1.00 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4756 0.00 NA NA 0 0 0
clean_adm4 4756 0.00 NA NA 0 0 0
installer 4756 0.00 NA NA 0 0 0
management_clean 1569 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 6 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4756 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4753 0.00 36 36 0 3 0
scheme_id 4756 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4756 0
notes 0 1.00 2 96 0 3499 0
orig_lnk 4753 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4715 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4746 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 19 0
adm2 0 1.00 3 14 0 33 0
adm3 4756 0.00 NA NA 0 0 0
management 1569 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4756 0
lat_lon_deg 0 1.00 13 32 0 4756 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4756 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
usage_capacity 0 1 FALSE 2 300: 2986, 100: 1770

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4756 0 NaN :
rehabilitator 4756 0 NaN :
is_urban 0 1 0.39 FAL: 2882, TRU: 1874
latest_record 0 1 1.00 TRU: 4756
status 0 1 0.56 TRU: 2642, FAL: 2114

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68551.03 10221.05 49601.00 66875.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1143 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4756 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.73 5650.02 0.01 719.36 2968.38 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3751.00 3939.74 0.15 460.50 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.65 1680.52 0.02 121.33 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16662.78 10961.08 53.05 7930.75 15020.40 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16732.33 12455.76 30.00 6876.92 12215.09 27745.52 44020.64 ▇▅▃▃▂
rehab_priority 2650 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 0 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 0 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 794 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 794 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
days_since_report 0 1.00 2692.69 41.94 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235864.87 6659.44 23741.00 230638.75 236198.50 240062.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4756 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4756 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

Correlation Analysis

osun_wp <- osun_wp_sf_clean %>%
  select(c(7,35:39,42:43,46:47,57))%>%
  st_set_geometry(NULL)#drop away the geometry column
cluster_vars.cor=cor(
  osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
               lower="ellipse",
               upper="number",
               tl.pos = "lt",
               diag="l",
               tl.col="black")

Building a logistic regression model

model<-glm(status ~ distance_to_primary_road+
                 distance_to_secondary_road+
                 distance_to_tertiary_road+
                 distance_to_city+
                 distance_to_town+
                 water_point_population+
                 local_population_1km+
                 usage_capacity+
                 is_urban+
                 water_source_clean,
           data=osun_wp_sf_clean,
           family=binomial(link='logit'))
blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

distance_to_tertiary_road, distance_to_city, distance

#report(model)
blr_confusion_matrix(model,cutoff=0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1301  738
         1   813 1904

                Accuracy : 0.6739 
     No Information Rate : 0.4445 

                   Kappa : 0.3373 

McNemars's Test P-Value  : 0.0602 

             Sensitivity : 0.7207 
             Specificity : 0.6154 
          Pos Pred Value : 0.7008 
          Neg Pred Value : 0.6381 
              Prevalence : 0.5555 
          Detection Rate : 0.4003 
    Detection Prevalence : 0.5713 
       Balanced Accuracy : 0.6680 
               Precision : 0.7008 
                  Recall : 0.7207 

        'Positive' Class : 1

Building a Geographically weight logistic regression (GWLR) model

Converting from sf to sp data frame

osun_wp_sp<-osun_wp_sf_clean %>%
  select(c(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean)) %>%
  as_Spatial()
osun_wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, usage_capacity, is_urban, water_source_clean 
min values  :      0,        0.014461356813335,          0.152195902540837,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,           1000,        0,           Borehole 
max values  :      1,         26909.8616132094,           19559.4793799085,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,            300,        1,   Protected Spring 

Building Fixed Bandwidth GWR Model

Computing fixed bandwidth

bw.fixed<- bw.ggwr(status~distance_to_primary_road+
                 distance_to_secondary_road+
                 distance_to_tertiary_road+
                 distance_to_city+
                 distance_to_town+
                 water_point_population+
                 local_population_1km+
                 usage_capacity+
                 is_urban+
                 water_source_clean,
              data=osun_wp_sp,
              family="binomial",
              approach = "AIC",
              kernel = "gaussian",
              adaptive = FALSE,
              longlat = FALSE)
bw.fixed
gwlr.fixed<-ggwr.basic(status~distance_to_primary_road+
                 distance_to_secondary_road+
                 distance_to_tertiary_road+
                 distance_to_city+
                 distance_to_town+
                 water_point_population+
                 local_population_1km+
                 usage_capacity+
                 is_urban+
                 water_source_clean,
              data=osun_wp_sp,
              bw=2597.255,
              family="binomial",
              kernel="gaussian",
              adaptive=FALSE,
              longlat=FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1957 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1403 
       5        -1403 
gwlr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-17 17:38:35 
   Call:
   ggwr.basic(formula = status ~ distance_to_primary_road + distance_to_secondary_road + 
    distance_to_tertiary_road + distance_to_city + distance_to_town + 
    water_point_population + local_population_1km + usage_capacity + 
    is_urban + water_source_clean, data = osun_wp_sp, bw = 2597.255, 
    family = "binomial", kernel = "gaussian", adaptive = FALSE, 
    longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road distance_to_city distance_to_town water_point_population local_population_1km usage_capacity is_urban water_source_clean
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-124.555    -1.755     1.072     1.742    34.333  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.887e-01  1.124e-01   3.459 0.000543
distance_to_primary_road                 -4.642e-06  6.490e-06  -0.715 0.474422
distance_to_secondary_road               -5.143e-06  9.299e-06  -0.553 0.580230
distance_to_tertiary_road                 9.683e-05  2.073e-05   4.671 3.00e-06
distance_to_city                         -1.686e-05  3.544e-06  -4.757 1.96e-06
distance_to_town                         -1.480e-05  3.009e-06  -4.917 8.79e-07
water_point_population                   -5.097e-04  4.484e-05 -11.369  < 2e-16
local_population_1km                      3.451e-04  1.788e-05  19.295  < 2e-16
usage_capacity1000                       -6.230e-01  6.972e-02  -8.937  < 2e-16
is_urbanTRUE                             -2.971e-01  8.185e-02  -3.629 0.000284
water_source_cleanProtected Shallow Well  5.040e-01  8.574e-02   5.878 4.14e-09
water_source_cleanProtected Spring        1.288e+00  4.388e-01   2.936 0.003325
                                            
Intercept                                ***
distance_to_primary_road                    
distance_to_secondary_road                  
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
water_point_population                   ***
local_population_1km                     ***
usage_capacity1000                       ***
is_urbanTRUE                             ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.0  on 4744  degrees of freedom
AIC: 5712

Number of Fisher Scoring iterations: 5


 AICc:  5712.099
 Pseudo R-square value:  0.1295351
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2597.255 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -8.9630e+02 -4.9805e+00  1.7599e+00
   distance_to_primary_road                 -1.9477e-02 -4.8092e-04  3.0174e-05
   distance_to_secondary_road               -1.5757e-02 -3.7583e-04  1.2438e-04
   distance_to_tertiary_road                -1.5673e-02 -4.2538e-04  7.6217e-05
   distance_to_city                         -1.8447e-02 -5.6287e-04 -1.2745e-04
   distance_to_town                         -2.2450e-02 -5.7335e-04 -1.5218e-04
   water_point_population                   -5.2830e-02 -2.2810e-03 -9.8829e-04
   local_population_1km                     -1.2757e-01  5.0016e-04  1.0647e-03
   usage_capacity1000                       -2.0846e+01 -9.7311e-01 -4.1596e-01
   is_urbanTRUE                             -1.9866e+02 -4.3054e+00 -1.6908e+00
   water_source_cleanProtected.Shallow.Well -2.0782e+01 -4.5536e-01  5.3278e-01
   water_source_cleanProtected.Spring       -5.2495e+02 -5.5983e+00  2.5500e+00
                                                3rd Qu.      Max.
   Intercept                                 1.2829e+01 1075.4234
   distance_to_primary_road                  4.8497e-04    0.0143
   distance_to_secondary_road                6.0665e-04    0.0259
   distance_to_tertiary_road                 6.7104e-04    0.0129
   distance_to_city                          2.3763e-04    0.0155
   distance_to_town                          1.9318e-04    0.0225
   water_point_population                    5.0564e-04    0.1313
   local_population_1km                      1.8177e-03    0.0392
   usage_capacity1000                        3.0334e-01    5.9492
   is_urbanTRUE                              1.2864e+00  746.9498
   water_source_cleanProtected.Shallow.Well  1.7870e+00   67.5549
   water_source_cleanProtected.Spring        6.7736e+00  331.1243
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2792.323 
   AIC : 4413.603 
   AICc : 4747.217 
   Pseudo R-square value:  0.5726785 

   ***********************************************************************
   Program stops at: 2022-12-17 17:39:12 

Model Assessment

Converting SDF into sf data frame

gwr.fixed<-as.data.frame(gwlr.fixed$SDF)
gwr.fixed<-gwr.fixed %>%
  mutate(most=ifelse(gwr.fixed$yhat>=0.5,T,F))
gwr.fixed$y<- as.factor(gwr.fixed$y)
gwr.fixed$most<- as.factor(gwr.fixed$most)
CM<-confusionMatrix(data=gwr.fixed$most,reference=gwr.fixed$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1824  263
     TRUE    290 2379
                                          
               Accuracy : 0.8837          
                 95% CI : (0.8743, 0.8927)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7642          
                                          
 Mcnemar's Test P-Value : 0.2689          
                                          
            Sensitivity : 0.8628          
            Specificity : 0.9005          
         Pos Pred Value : 0.8740          
         Neg Pred Value : 0.8913          
             Prevalence : 0.4445          
         Detection Rate : 0.3835          
   Detection Prevalence : 0.4388          
      Balanced Accuracy : 0.8816          
                                          
       'Positive' Class : FALSE           
                                          

Visualizing gwLR

osun_wp_sf_selected<-osun_wp_sf_clean%>%
  select(c(ADM2_EN,ADM2_PCODE,ADM1_EN,ADM1_PCODE,status
           ))
gwr_sf.fixed<-cbind(osun_wp_sf_selected,gwr.fixed)
tmap_mode("view")
tmap mode set to interactive viewing
prob_T<-tm_shape(osun)+
  tm_polygons(alpha=0.1)+
  tm_shape(gwr_sf.fixed)+
  tm_dots(col="yhat",
          border.col="gray60",
          border.lwd=1)+
  tm_view(set.zoom.limits = c(8,14))
prob_T

Re-calibrate the model without insignificant variables

From the above logistic regression model result, the p-value of distance_to_primary_road and distance_to_secondary_road are 0.4744 and 0.5802. Both of them and higher than 0.05, which means those variables are not significant. Now we re-calibrate the logistic regression model and GWLR model by excluding the two independent variables that are not statistically significant during the initial round of model calibration.

Building the new logistic regression model

model_new<-glm(status ~ 
                 distance_to_tertiary_road+
                 distance_to_city+
                 distance_to_town+
                 water_point_population+
                 local_population_1km+
                 usage_capacity+
                 is_urban+
                 water_source_clean,
           data=osun_wp_sf_clean,
           family=binomial(link='logit'))
blr_regress(model_new)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4746           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3540        0.1055      3.3541       8e-04 
       distance_to_tertiary_road            1      1e-04         0.0000      4.9096      0.0000 
            distance_to_city                1      0.0000        0.0000     -5.2022      0.0000 
            distance_to_town                1      0.0000        0.0000     -5.4660      0.0000 
         water_point_population             1      -5e-04        0.0000    -11.3902      0.0000 
          local_population_1km              1      3e-04         0.0000     19.4069      0.0000 
           usage_capacity1000               1     -0.6206        0.0697     -8.9081      0.0000 
              is_urbanTRUE                  1     -0.2667        0.0747     -3.5690       4e-04 
water_source_cleanProtected Shallow Well    1      0.4947        0.0850      5.8228      0.0000 
   water_source_cleanProtected Spring       1      1.2790        0.4384      2.9174      0.0035 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7349          Somers' D        0.4697   
% Discordant          0.2651          Gamma            0.4697   
% Tied                0.0000          Tau-a            0.2320   
Pairs                5585188          c                0.7349   
---------------------------------------------------------------
blr_confusion_matrix(model_new,cutoff=0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1300  743
         1   814 1899

                Accuracy : 0.6726 
     No Information Rate : 0.4445 

                   Kappa : 0.3348 

McNemars's Test P-Value  : 0.0761 

             Sensitivity : 0.7188 
             Specificity : 0.6149 
          Pos Pred Value : 0.7000 
          Neg Pred Value : 0.6363 
              Prevalence : 0.5555 
          Detection Rate : 0.3993 
    Detection Prevalence : 0.5704 
       Balanced Accuracy : 0.6669 
               Precision : 0.7000 
                  Recall : 0.7188 

        'Positive' Class : 1

Compare with the previous gwLR model, the accuracy decrease from 0.6739 to 0.6726.

Building a Geographically weight logistic regression (GWLR) model

Converting from sf to sp data frame

osun_wp_sp_new<-osun_wp_sf_clean %>%
  select(c(status,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean)) %>%
  as_Spatial()
osun_wp_sp_new
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 9
names       : status, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, usage_capacity, is_urban, water_source_clean 
min values  :      0,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,           1000,        0,           Borehole 
max values  :      1,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,            300,        1,   Protected Spring 

Building Fixed Bandwidth GWR Model

Computing fixed bandwidth
bw.fixed_new<- bw.ggwr(status~
                 distance_to_tertiary_road+
                 distance_to_city+
                 distance_to_town+
                 water_point_population+
                 local_population_1km+
                 usage_capacity+
                 is_urban+
                 water_source_clean,
              data=osun_wp_sp_new,
              family="binomial",
              approach = "AIC",
              kernel = "gaussian",
              adaptive = FALSE,
              longlat = FALSE)
bw.fixed_new
gwlr.fixed_new<-ggwr.basic(status~
                 distance_to_tertiary_road+
                 distance_to_city+
                 distance_to_town+
                 water_point_population+
                 local_population_1km+
                 usage_capacity+
                 is_urban+
                 water_source_clean,
              data=osun_wp_sp_new,
              bw=2597.255,
              family="binomial",
              kernel="gaussian",
              adaptive=FALSE,
              longlat=FALSE)
 Iteration    Log-Likelihood
=========================
       0        -2034 
       1        -1772 
       2        -1635 
       3        -1561 
       4        -1530 
       5        -1530 
gwlr.fixed_new
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-17 17:39:14 
   Call:
   ggwr.basic(formula = status ~ distance_to_tertiary_road + distance_to_city + 
    distance_to_town + water_point_population + local_population_1km + 
    usage_capacity + is_urban + water_source_clean, data = osun_wp_sp_new, 
    bw = 2597.255, family = "binomial", kernel = "gaussian", 
    adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_tertiary_road distance_to_city distance_to_town water_point_population local_population_1km usage_capacity is_urban water_source_clean
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-129.368    -1.750     1.074     1.742    34.126  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.540e-01  1.055e-01   3.354 0.000796
distance_to_tertiary_road                 1.001e-04  2.040e-05   4.910 9.13e-07
distance_to_city                         -1.764e-05  3.391e-06  -5.202 1.97e-07
distance_to_town                         -1.544e-05  2.825e-06  -5.466 4.60e-08
water_point_population                   -5.098e-04  4.476e-05 -11.390  < 2e-16
local_population_1km                      3.452e-04  1.779e-05  19.407  < 2e-16
usage_capacity1000                       -6.206e-01  6.966e-02  -8.908  < 2e-16
is_urbanTRUE                             -2.667e-01  7.474e-02  -3.569 0.000358
water_source_cleanProtected Shallow Well  4.947e-01  8.496e-02   5.823 5.79e-09
water_source_cleanProtected Spring        1.279e+00  4.384e-01   2.917 0.003530
                                            
Intercept                                ***
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
water_point_population                   ***
local_population_1km                     ***
usage_capacity1000                       ***
is_urbanTRUE                             ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.9  on 4746  degrees of freedom
AIC: 5708.9

Number of Fisher Scoring iterations: 5


 AICc:  5708.923
 Pseudo R-square value:  0.129406
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2597.255 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -2.7771e+02 -3.9915e+00  2.9346e+00
   distance_to_tertiary_road                -2.0066e-02 -3.6016e-04  8.9385e-05
   distance_to_city                         -3.0931e-02 -5.6273e-04 -1.0359e-04
   distance_to_town                         -3.4702e-03 -4.3133e-04 -1.2398e-04
   water_point_population                   -3.5450e-02 -2.0856e-03 -1.1271e-03
   local_population_1km                     -5.8060e-02  4.0342e-04  1.0001e-03
   usage_capacity1000                       -4.5295e+01 -1.0249e+00 -3.8880e-01
   is_urbanTRUE                             -3.0233e+02 -3.1725e+00 -1.4861e+00
   water_source_cleanProtected.Shallow.Well -1.0470e+02 -4.2423e-01  5.9626e-01
   water_source_cleanProtected.Spring       -7.9160e+02 -5.4086e+00  2.5525e+00
                                                3rd Qu.      Max.
   Intercept                                 1.0668e+01 1102.7459
   distance_to_tertiary_road                 5.3918e-04    0.0140
   distance_to_city                          1.2672e-04    0.0129
   distance_to_town                          2.2159e-04    0.0161
   water_point_population                    1.9400e-04    0.0569
   local_population_1km                      1.6838e-03    0.0293
   usage_capacity1000                        3.5031e-01    5.9152
   is_urbanTRUE                              8.9541e-01  739.6369
   water_source_cleanProtected.Shallow.Well  1.8040e+00   52.4657
   water_source_cleanProtected.Spring        6.5117e+00  152.2614
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 3051.369 
   AIC : 4499.24 
   AICc : 4759.621 
   Pseudo R-square value:  0.5330355 

   ***********************************************************************
   Program stops at: 2022-12-17 17:39:42 

Model Assessment

Converting SDF into sf data frame
gwr.fixed_new<-as.data.frame(gwlr.fixed_new$SDF)
gwr.fixed_new<-gwr.fixed_new %>%
  mutate(most=ifelse(gwr.fixed_new$yhat>=0.5,T,F))
gwr.fixed_new$y<- as.factor(gwr.fixed_new$y)
gwr.fixed_new$most<- as.factor(gwr.fixed_new$most)
CM<-confusionMatrix(data=gwr.fixed_new$most,reference=gwr.fixed_new$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1792  302
     TRUE    322 2340
                                          
               Accuracy : 0.8688          
                 95% CI : (0.8589, 0.8783)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7341          
                                          
 Mcnemar's Test P-Value : 0.4469          
                                          
            Sensitivity : 0.8477          
            Specificity : 0.8857          
         Pos Pred Value : 0.8558          
         Neg Pred Value : 0.8790          
             Prevalence : 0.4445          
         Detection Rate : 0.3768          
   Detection Prevalence : 0.4403          
      Balanced Accuracy : 0.8667          
                                          
       'Positive' Class : FALSE           
                                          

Compare with the previous gwLR model, the accuracy decrease from 0.8837 to 0.8668.

Visualizing gwLR

osun_wp_sf_selected_new<-osun_wp_sf_clean%>%
  select(c(ADM2_EN,ADM2_PCODE,ADM1_EN,ADM1_PCODE,status
           ))
gwr_sf.fixed_new<-cbind(osun_wp_sf_selected_new,gwr.fixed_new)
tmap_mode("view")
tmap mode set to interactive viewing
prob_T<-tm_shape(osun)+
  tm_polygons(alpha=0.1)+
  tm_shape(gwr_sf.fixed_new)+
  tm_dots(col="yhat",
          border.col="gray60",
          border.lwd=1)+
  tm_view(set.zoom.limits = c(8,14))
prob_T